Emphatic TD Bellman Operator is a Contraction
نویسندگان
چکیده
Recently, Sutton et al. (2015) introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a √ γ-contraction modulus (where γ is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.
منابع مشابه
True Online Emphatic TD(λ): Quick Reference and Implementation Guide
TD(λ) is the core temporal-difference algorithm for learning general state-value functions (Sutton 1988, Singh & Sutton 1996). True online TD(λ) is an improved version incorporating dutch traces (van Seijen & Sutton 2014, van Seijen, Mahmood, Pilarski & Sutton 2015). Emphatic TD(λ) is another variant that includes an “emphasis algorithm” that makes it sound for off-policy learning (Sutton, Mahm...
متن کاملTrue Online Emphatic TD($\lambda$): Quick Reference and Implementation Guide
This document is a guide to the implementation of true online emphatic TD(λ), a model-free temporal-difference algorithm for learning to make long-term predictions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the true-online idea (van Seijen & Sutton 2014). The setting used here includes linear function approximation, the possibility of off-policy training, and all the ge...
متن کاملO2TD: (Near)-Optimal Off-Policy TD Learning
Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions are optimal w.r.t approximating the true value function V . Two novel algorithms are proposed to approximate the true value function V . This paper makes the following contributions: • A batch algorit...
متن کاملThe Curvature of a Single Contraction Operator on a Hilbert Space ∗
This note studies Arveson’s curvature invariant for d-contractions T = (T1, T2, . . . , Td) for the special case d = 1, referring to a single contraction operator T on a Hilbert space. It establishes a formula which gives an easy-to-understand meaning for the curvature of a single contraction. The formula is applied to give an example of an operator with nonintegral curvature. Under the additio...
متن کاملGeneralized Emphatic Temporal Difference Learning: Bias-Variance Analysis
We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm (Sutton, Mahmood, and White, 2015), which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1508.03411 شماره
صفحات -
تاریخ انتشار 2015